Jan 31, 2004 2:31pm
Texts in PDF
I was talking to Michael Lesk at the DIAL '04
conference, and it got me interested in the
Million Books project.
I took one of the books (Early Jazz) and tried
converting it to PDF using Acrobat 6, JBIG2
"lossy" compression on the TIF file, and it
turned the 88 MB .tif file into a 7.34 MB .pdf.
If you OCR the pages so there is text as well
as images in the same file, you get 16.7MB:
This compares to 33.9MB for DjVu.
I imagine these only work with the Acrobat 6
The "reduce file size" using JBIG2 cleans up
text images a bit, since it reduces noise.
Anyway, I'm interested in helping if someone wants
to pursue this; let me know.